Eventually, we will apply it to causality
How are two variables associated?
How are these variables associated?
How are these variables associated?
correlation: allows us to describe the assocation numerically.
Covariance
\[Cov(X,Y) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\]
\[Cov(X,Y) = \overline{xy} - \bar{x}\bar{y}\]
Variance
Variance is also the covariance of a variable with itself:
\[Var(X) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})^2\]
\[Var(X) = \overline{x^2} - \bar{x}^2\]
Covariance of tree width and tree height:
## [1] 10.38333
Covariance of tree width and timber volume:
## [1] 49.88812
Why is the second one larger?
Why is the second one larger?
Scale of covariance reflects scale of the variables.
Covariance
\(Cov(X,Y) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\)
Pearson Correlation
\(r(X,Y) = \frac{Cov(X,Y)}{SD(X)SD(Y)}\)
Correlation coefficient must be between -1, 1
At -1 or 1, the points are on a single line (with a fixed slope)
Negative value implies increased in one variable associated with decrease in the other.
At 0, the covariance must be 0
Correlation of \((x,y)\) is same as correlation of \((y,x)\)
Generally, values closer to -1 or 1 imply “stronger” association
But:
Which of has the greatest association? The least?
What is the distance between two points?
What is the distance between two points?
What is the distance between two points?
\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]
\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \ldots + (p_n - q_n)^2}\]
The mean minimizes the variance.
Imagine we have a variable \(Y\) that we observe as a sample of size \(n\). We can represent this variable as a vector in \(n\) dimensional space.
\[y = \begin{pmatrix}3 \\ 5 \end{pmatrix}\] - a matrix is a rectangular array of numbers. \(X\) is a \(3 \times 2\) matrix. \(3\) rows and \(2\) columns.
\[ X = \begin{pmatrix} 1 & -1 \\ 1 & 2 \\ 1 & 4 \end{pmatrix}\]
Matrices and vectors of the same dimensions can be added or subtracted element-by-element.
For example:
\[\begin{pmatrix}1 \\ 1 \end{pmatrix} + \begin{pmatrix}-2 \\ 3 \end{pmatrix} = \begin{pmatrix}-1 \\ 4 \end{pmatrix}\]
Vectors are matrices with a single column (or row). scalars are \(1 \times 1\) matrices
\[2.5 \cdot \begin{pmatrix}1 \\ 1 \end{pmatrix} = \begin{pmatrix} 2.5 \\ 2.5 \end{pmatrix}\]
\[0.5 \cdot \begin{pmatrix}-2 \\ 3 \end{pmatrix} = \begin{pmatrix} -1 \\ 1.5 \end{pmatrix}\]
\(a = 2.5 \cdot \begin{pmatrix}1 \\ 1 \end{pmatrix} = \begin{pmatrix} 2.5 \\ 2.5 \end{pmatrix}; \ b = 0.5 \cdot \begin{pmatrix}-2 \\ 3 \end{pmatrix} = \begin{pmatrix} -1 \\ 1.5 \end{pmatrix}\)
Matrices must have dimensions that match:
\[\begin{pmatrix} 1 & 1 \\ -1 & 2 \end{pmatrix} \times \begin{pmatrix} 1 & -1 \\ 1 & 2 \end{pmatrix} =\]
\[\begin{pmatrix} (1 \cdot 1)+(1\cdot1) & (1\cdot-1) + (1 \cdot 2) \\ (-1\cdot1) + (2\cdot1) & (-1\cdot-1) + (2\cdot2) \end{pmatrix} = \]
\[ = \begin{pmatrix} 2 & 1 \\ 1 & 5 \end{pmatrix}\]
If \(u\) and \(v\) are \(n \times 1\) vectors, the inner product or dot product of \(u \bullet v = u' \times v\); \(u'\) is transpose of \(u\).
\[u = \begin{pmatrix} 1 \\ -2 \end{pmatrix}; \ v =\begin{pmatrix} 4 \\ 2 \end{pmatrix} \]
\[u \bullet v = \begin{pmatrix} 1 & -2 \end{pmatrix} \begin{pmatrix} 4 \\ 2 \end{pmatrix} = 0\] - When the inner product is \(0\), then \(u\) and \(v\) are orthogonal: \(u \perp v\)
Imagine we have a variable \(Y\) that we observe as a sample of size \(n\). We can represent this variable as a vector in \(n\) dimensional space.
\[y = \begin{pmatrix}3 \\ 5 \end{pmatrix}\]
We want to pick one number (a scalar) \(\hat{y}\) to predict all of the values of our vector \(y\).
Within the \(n\) dimensional space containing our vector \(y\), we want to pick the best prediction of \(y\) (which is in \(n\) dimensions) that is in one dimension: find a single point on a line.
Projecting this into the same dimensional space as \(y\) produces:
\(\hat{y} \cdot \begin{pmatrix}1 \\ 1 \end{pmatrix} = \begin{pmatrix}\hat{y} \\ \hat{y} \end{pmatrix}\).
\(\hat{y}\) must be on the blue line. We want to pick the point (prediction) that minimizes the distance to \(y\).
\(y = \begin{pmatrix}3 \\ 5 \end{pmatrix}\)
can be decomposed into two separate vectors: a vector containing our prediction (\(\hat{y}\)):
\(\begin{pmatrix} \hat{y} \\ \hat{y} \end{pmatrix} = \hat{y} \begin{pmatrix} 1 \\ 1 \end{pmatrix}\)
and another vector \(\mathbf{e}\), which is difference between the prediction vector and the vector of observations:
\(\mathbf{e} = \begin{pmatrix}3 \\ 5 \end{pmatrix} - \begin{pmatrix} \hat{y} \\ \hat{y} \end{pmatrix}\)
This means our goal is to minimize \(\mathbf{e}\).
How do we find the closest distance? The length of \(\mathbf{e}\) is calculated by taking:
\[len(\mathbf{e})= \sqrt{(3-\hat{y})^2 + (5 - \hat{y})^2}\]
The minimum length is obtained when the angle between \(\hat{y} \begin{pmatrix} 1 \\ 1 \end{pmatrix}\) and \(\mathbf{e}\) is \(90^{\circ}\). That is to say
\(\begin{pmatrix} 1 \\ 1 \end{pmatrix} \perp \mathbf{e}\).
We know that two vector are orthogonal (\(\perp\)) when their dot product is \(0\), so we can create the following equality and solve for \(\hat{y}\).
\(\mathbf{e} \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix} = 0\)
\((\begin{pmatrix}3 & 5 \end{pmatrix} - \begin{pmatrix} \hat{y} & \hat{y} \end{pmatrix}) \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix} = 0\)
\((\begin{pmatrix}3 & 5 \end{pmatrix} - \hat{y} \begin{pmatrix} 1 & 1 \end{pmatrix}) \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix} = 0\)
\((\begin{pmatrix} 3 & 5 \end{pmatrix} \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix}) - (\hat{y} \begin{pmatrix} 1 & 1 \end{pmatrix} \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix}) = 0\)
\((8) - (\hat{y} 2) = 0\)
\(8 = \hat{y} 2\)
\(4 = \hat{y}\)
\(\mathbf{e} \bullet \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} = 0\)
\((\begin{pmatrix} y_1 & \ldots & y_n \end{pmatrix} - \begin{pmatrix} \hat{y} & \ldots & \hat{y} \end{pmatrix}) \bullet \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} = 0\)
\((\begin{pmatrix} y_1 & \ldots & y_n \end{pmatrix} - \hat{y}\begin{pmatrix} 1 & \ldots & 1 \end{pmatrix}) \bullet \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} = 0\)
\((\sum\limits_{i=1}^{n} y_i\cdot1) - \hat{y} \sum\limits_{i=1}^{n} 1 = 0\)
\(\sum\limits_{i=1}^{n} y_i = \hat{y} n\)
\(\frac{1}{n}\sum\limits_{i=1}^{n} y_i = \hat{y}\)
… but often we want to know if the mean of something \(Y\) is different across different values of something else \(X\).
To put it another way: the mean of \(Y\) is the \(E(Y)\) (if we are talking about random variables). Sometimes we want to know \(E(Y | X)\)
expectation: because it is about the mean of \(Y\)
conditional: because it is conditioned of different values of \(X\).
function: because \(E(Y) = f(X)\), there is some relationship we can look at between values of \(X\) and \(E(Y)\).
One powerful and simple way is to assume that the conditional expectation function is linear.
That is to say \(E(Y)\) is linear in \(X\). The function takes the form of an equation of a line.
or, by convention:
\[E(Y) = a + b\cdot X\]
The red line above is the regression line or the fit using least squares.
It closely approximates the conditional mean of son’s height (\(Y\)) across values of father’s height (\(X\)).
How do we obtain this line mathematically?
We can do it the same way we obtained the mean!
Regression works similarly:
Rather than reduce the \(n \times 1\)-dimensional vector \(\mathbf{Y}\) into one dimension (as we did with the mean), we reduce it into \(p\) (number of parameters) dimensional space. This requires more dimensions than can easily be visualized, but we still end up minimizing the distance between our \(n\) dimensional vector \(\mathbf{\hat{Y}}\) and the vector \(\mathbf{Y}\).
Given \(\mathbf{Y}\), an \(n \times 1\) dimensional vector of all values of dependent variables \(Y\) for \(n\) observations
and \(\mathbf{X}\), an \(n \times p\) dimensional matrix (\(p\) independent variables, including an intercept, \(n\) observations)
\(\mathbf{\hat{Y}}\) is an \(n \times 1\) dimensional vector of predicted values (for the mean of Y conditional on X) computed by \(\mathbf{X\beta}\). \(\mathbf{\beta}\) is a vector \(p\times 1\) of (parameters) that we multiply by \(\mathbf{X}\).
Today we’ll assume there are only two parameters in \(\mathbf{\beta}\): \(a,b\) from \(Y_i = a + b \cdot X_i\), so \(p = 2\)
We want to choose \(\mathbf{\beta}\) or \(a,b\) such that the distance between \(\mathbf{Y}\) and \(\mathbf{\hat{Y}}\) is minimized. Or sum of squared residuals is minimized. (identical conditions)
Like before, the distance is minimized when the vector of residuals \(\mathbf{Y} - \mathbf{\hat{Y}} = \mathbf{e}\) is \(\perp\) to \(\mathbf{X}\)
\(\mathbf{X}'_{p\times n}\mathbf{e}_{n\times1} = \begin{pmatrix} 0 \\ 0 \end{pmatrix}\)
\(\mathbf{X}'(\mathbf{Y} - \mathbf{\hat{Y}}) = 0\)
\(\mathbf{X}'(\mathbf{Y} - \mathbf{X{\beta}}) = 0\)
\(\mathbf{X}'\mathbf{Y} = \mathbf{X}'\mathbf{X{\beta}}\)
\((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\)
here \(\mathbf{X{\beta}} = a + b \cdot X\)
\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\]
This is the matrix formula for least squares regression. If \(X\) is a column vector of 1s, \(\beta\) is just the mean of \(Y\). This is in fact identical to what we solved for earlier.
But we also want to know more intuitively what these matrix operations are doing! It isn’t magic.
\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\]
\[\mathbf{X}'\mathbf{X} = \begin{pmatrix} 1 & \ldots & 1 \\ x_1 & \ldots & x_n \end{pmatrix} \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}\]
\[= \begin{pmatrix} n & \sum_i x_i \\ \sum_i x_i & \sum_i x_i^2 \end{pmatrix} = n \begin{pmatrix} 1 & \overline{x} \\ \overline{x} & \overline{x^2} \end{pmatrix}\]
\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\]
\[\mathbf{X}'\mathbf{Y} = \begin{pmatrix} 1 & \ldots & 1 \\ x_1 & \ldots & x_n \end{pmatrix} \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\] \[= \begin{pmatrix} \sum_i y_i \\ \sum_i x_i y_i \end{pmatrix} = n \begin{pmatrix} \overline{y} \\ \overline{xy} \end{pmatrix}\]
How do we get \(^{-1}\)? This is inverting a matrix.
\[A \times A^{-1} = A^{-1} \times A = I_{p \times p} = \begin{pmatrix} 1 & 0 & \ldots & 0 \\ 0 & \ddots & \ldots & 0 \\ 0 & \ldots & \ddots & 0 \\ 0 & \ldots & 0 & 1 \end{pmatrix}\]
This is an identity matrix with 1s on diagonal, 0s everywhere else.
We need to get the determinant
For sake of ease, will show for a scalar and for a \(2 \times 2\) matrix:
\[det(a) = a\]
\[det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - cb\]
Then we need to get the adjoint. It is the transpose of the matrix of cofactors (don’t ask me why):
\[adj(a) = 1\]
\[adj\begin{pmatrix} a & b \\ c & d \end{pmatrix} = \begin{pmatrix} d & -b \\ -c & a \end{pmatrix}\]
The inverse of \(A\) is \(adj(A)/det(A)\)
\[A^{-1} = \frac{1}{ad - cb}\begin{pmatrix} d & -b \\ -c & a \end{pmatrix}\]
\[A^{-1} = \frac{1}{ad - cb}\begin{pmatrix} d & -b \\ -c & a \end{pmatrix}\]
\[(\mathbf{X}'\mathbf{X}) = n \begin{pmatrix} 1 & \overline{x} \\ \overline{x} & \overline{x^2} \end{pmatrix}\]
\[(\mathbf{X}'\mathbf{X})^{-1} = \frac{n}{n^2(\overline{x^2} - \overline{x}^2)} \begin{pmatrix} \overline{x^2} & -\overline{x} \\ -\overline{x} & 1 \end{pmatrix}\]
\[(\mathbf{X}'\mathbf{X})^{-1} = \frac{1}{n \cdot Var(x)} \begin{pmatrix} \overline{x^2} & -\overline{x} \\ -\overline{x} & 1 \end{pmatrix}\]
We can put it together to get: \((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\)
\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \frac{1}{n \cdot Var(x)} \begin{pmatrix} \overline{x^2} & -\overline{x} \\ -\overline{x} & 1 \end{pmatrix} n \begin{pmatrix} \overline{y} \\ \overline{xy} \end{pmatrix}\]
\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \frac{1}{Var(x)} \begin{pmatrix} \overline{x^2}\overline{y} -\overline{x} \ \overline{xy} \\ \overline{xy} - \overline{x}\ \overline{y}\end{pmatrix} = \begin{pmatrix}a \\ b \end{pmatrix}\]
\[\beta = \frac{1}{Var(x)} \begin{pmatrix} \overline{x^2}\overline{y} -\overline{x} \ \overline{xy} \\ \overline{xy} - \overline{x} \ \overline{y}\end{pmatrix} = \begin{pmatrix}a \\ b \end{pmatrix}\]
\[b = \frac{\overline{xy} - \overline{x} \ \overline{y}}{Var(x)} = \frac{Cov(x,y)}{Var(x)}\]
\[b = \frac{Cov(x,y)}{Var(x)} = r \frac{SD_y}{SD_x}\]
\[\beta = \frac{1}{Var(x)} \begin{pmatrix} \overline{x^2}\overline{y} -\overline{x} \ \overline{xy} \\ \overline{xy} - \overline{x} \ \overline{y}\end{pmatrix} = \begin{pmatrix}a \\ b \end{pmatrix}\]
\[a = \frac{\overline{x^2}\overline{y} -\overline{x} \ \overline{xy}}{Var(x)} = \frac{(Var(x) + \overline{x}^2)\overline{y} - \overline{x}(Cov(x,y) + \overline{x}\overline{y})}{Var(x)}\]
\[= \frac{Var(x)\overline{y} + \overline{x}^2\overline{y} - \overline{x}^2\overline{y} - \overline{x}Cov(x,y)}{Var(x)}\]
\[= \overline{y} - \overline{x}\frac{Cov(x,y)}{Var(x)}\]
\[a = \overline{y} - \overline{x}\cdot b\]
\[a = \overline{y} - \overline{x}\cdot b\]
Shows us that at \(\bar{x}\), the line goes through \(\bar{y}\). The regression line (of predicted values) goes through the point \((\bar{x}, \bar{y})\) or the point of averages.
There are other ways to derive least squares.
The mathematical procedures we use in regression ensure that:
the mean of the residuals is always zero. Because we included an intercept (\(a\)), and the regression line goes through the point of averages, the mean of the residuals is always 0. \(\overline{e} = 0\). This is also true of residuals of the mean.
\(Cov(X,e) = 0\). This is true by definition of how we derived least squares. We chose \(\beta\) (\(a,b\)) such that \(X'e = 0\) so they would be orthogonal. \(X'e = 0 \to \overline{xe}=0\); \(\overline{e}=0\) from above; so \(Cov(X,e) = \overline{xe}-\overline{x}\overline{e} = 0 - \overline{x}0 = 0\).
These two facts are unrelated to assumptions we will make later for statistical and causal inference. They mathematical truths about regression/least squares.
We can fit a regression line to any scatterplot (regardless of how sensical it is), if \(x\) has positive variance.
The regression line minimizes the sum of squared residuals (minimizes the distance between predicted values and actual values of \(y\)). This is why it is called “least squares”
Residuals \(e\) are always uncorrelated with \(x\) if there is an intercept because they are orthogonal \(x\) with mean of \(0\).